Data Exploration (part2)

This is the second part of the Data Exploration series. Part1 can be found here.In this part 2, we will dig further into the Data set to make informed decisions about the features we need to engineer to accurately classify human activities.

We knew we will need some frequency domain statistics of signals but we haven't developed any. Lets pick up from there.

We need 128 point FFTs of all the signal blocks we analyzed in part 1. We are primarily interested in the signal energy in each bin. So we look at the power in the bins instead of the complex FFT. Furthermore, we are really interested in the percentage of total power in each bin, not the absolute power in it. File fft_energy_percent.csv contains this information. We read that file in, set hierarchical (a.k.a multi-index) index to the Data Frame. Hierarchical indexing and groupby operations are going to be very useful as we will see soon.

1. Frequency Domain Statistics

Customary imports and setup

In [1]:
# Bulk of the analysis is in Pandas and Numpy
import pandas as pd
import numpy as np

#Plotly is the Graphing library
import plotly.plotly as py
import plotly.offline as pyoffline
import plotly.tools as tls
import plotly.graph_objs as go

from plot_utilities import *

import sys

# Set Plotly to offline mode to render graphs within this notebook
pyoffline.init_notebook_mode()

# Signals are broken down into 128 sample blocks. i.e 2.56 seconds
# 128 samples per block
block_sz = 128
sampling_rate = 50

# index in seconds
index_in_sec = np.linspace(0,(block_sz-1)/50,128)

# Index for frequency plots in Hz. 0 to Fs/2, i.e 0 to 25Hz
freq_index = np.linspace(0,25,num=64)

labels = pd.read_csv('../Data/activity_labels.txt',names=['activity','label_name'],delim_whitespace=True)
labels = labels.set_index('activity')
In [2]:
fft_energy_percent = pd.read_csv('../Data/FilteredData/fft_energy_percent.csv')

#  Use Multi indexing (i.e hierarchical indexing) to make things easier
fft_energy_percent.set_index(['label','block_idx','bin_idx'],inplace=True,drop=True)

#  Convert %s to a scale of 100
fft_energy_percent *= 100

We can see how this frequency domain data looks like. As expected, for "DC" signals most of the energy is in bin[0]. Below is a subplot showing spectrum for acceleration signal in z direction for one signal block. We can see it is symmetric as we would expect for real signals. So we can ignore the second half of each FFT without losing information. Also note that while the general shape of these signals look similar, the details vary. For example, the height of the peaks and the location of secondary peaks seems to vary.

To see if there are any systematic differences in power profile between different activities, we can average the spectrum over several blocks. This gives an estimate of power spectrum for each activity and we can compare them against each other, which we will do next.

In [3]:
fft_energy_percent.head(2)
Out[3]:
ac_acc_x ac_acc_y ac_acc_z dc_acc_x dc_acc_y dc_acc_z mag_acc gyro_x gyro_y gyro_z mag_gyro
label block_idx bin_idx
5 0 0 25.377169 16.542831 28.824037 99.999212 99.943381 97.938984 99.999433 35.026498 23.391055 15.856708 76.212384
1 14.375878 11.668840 20.950104 0.000264 0.021343 0.771215 0.000001 4.238838 4.230348 6.281836 1.361649
In [4]:
plot_fft_energy_percent(fft_energy_percent)
Drawing...

Average over all blocks for each activity. Full power of groupby with hierarchical indexing in effect here.

In [5]:
fft_engy_percent_estimate = fft_energy_percent.groupby(level=['label','bin_idx']).mean()

# Drop second half of the spectrum
drop_indices = list(np.arange(65,128))
one_sided_fft_engy = fft_engy_percent_estimate.drop(drop_indices,level='bin_idx')

# Delete some objects to free memory
del fft_energy_percent
del fft_engy_percent_estimate

one_sided_fft_engy.tail(2)
Out[5]:
ac_acc_x ac_acc_y ac_acc_z dc_acc_x dc_acc_y dc_acc_z mag_acc gyro_x gyro_y gyro_z mag_gyro
label bin_idx
12 63 0.001964 0.004727 0.002855 0.006676 0.000856 0.002539 0.000091 0.001611 0.001931 0.002958 0.003310
64 0.001962 0.004721 0.002852 0.006672 0.000856 0.002538 0.000092 0.001610 0.001931 0.002956 0.003488

Plot one sided spectrum estimate of some signals

In [6]:
idx_range = list(range(65))
signal = 'ac_acc_z'

plot_one_sided_fft(one_sided_fft_engy,signal,idx_range,labels,freq_index)
Drawing...

We can see from the above plot that most of the energy is within 10Hz of the signals. We can confirm this for other signals as well. Moreover, most of the differences in energy profiles between various signals lie in the first 5 Hz. So we can add the first 12 bins of all signals to the feature vector.

A zoomed version of the above plot follows. We can see the peaks and peak locations show differences between some pairs of signals

In [7]:
idx_range = list(range(12))
signal = 'ac_acc_z'

plot_one_sided_fft(one_sided_fft_engy,signal,idx_range,labels,freq_index)
Drawing...
In [8]:
#  Delete to save mem
del one_sided_fft_engy

2. Scatter Plots

Combining the temporal features discussed in part 1 and the frequency domain features disscussed in this part, we generated composite feature vecotors for each sample block. We will take a look at those below.

In [9]:
features_f = "../Data/FilteredData/subset_feature_scaled.csv"
labels_f = "../Data/FilteredData/subset_sample_labels.csv"

features = pd.read_csv(features_f)
labels_series = pd.read_csv(labels_f,header=None)
labels_series.columns = ['label']
features_df = pd.concat([features,labels_series],axis=1)
del features
del labels_series

We can plot combinations of features to see if there is a decent separation between different activities. Recall that it was not so easy to see obvious differences between signals of walking, walking upstairs and walking downstairs. Lets see if the full set of features show any promise.

Some temportal statistics (mean, RMS and STD of magnitude of acceleration signal) between Walking and Walking downstairs is shown below. It is quite clear that there is a separating hyperplane between these two. (Rotate the picture to see it)

In [10]:
plot_3d_scatter(features_df,labels,activities=[1,3],signals=['mean_mag_acc','rms_mag_acc','std_mag_acc'])
Drawing...

Lets see the same between Walking and Walking Upstairs. Here the separation isn't quite obvious. But we try other combinations of features. Remember, we only need atleast one dimension in which we can separate the activities.

In [11]:
plot_3d_scatter(features_df,labels,activities=[1,2],signals=['mean_mag_acc','rms_mag_acc','std_mag_acc'])
Drawing...

Below is the plot of first three frequency domain features for the gyroscope signal for the same activities. We still don't see a clear separation but atleast it appears that the blue points are clustered closely while the red ones are scatterd around. It is possible to do a decent job if we choose a nonlinear decision boundary that circles around the blue points.

In [12]:
plot_3d_scatter(features_df,labels,activities=[1,2],signals=['gyro_y_0','gyro_y_1','gyro_y_2'])
Drawing...

3. Conclusion

In this part we explored some frequency domain charecteristics of the data. We looked at how the combined feature set that inlcuded temporal and frequency domain statistics of the signals can help separate Human Activities. The separation in some cases is linear and in other cases it is nonlinear. So we know we have to choose a non-linear classifier for this data set. In the next part, we will apply some classfiers to the data set and see how they perform.

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: